Finite Size Corrections and Likelihood Ratio Fluctuations in the Spiked Wigner Model
Abstract
In this paper we study principal components analysis in the regime of high dimensionality and high noise. Our model of the problem is a rank-one deformation of a Wigner matrix where the signal-to-noise ratio (SNR) is of constant order, and we are interested in the fundamental limits of detection of the spike. Our main goal is to gain a fine understanding of the asymptotics for the log-likelihood ratio process, also known as the free energy, as a function of the SNR. Our main results are twofold. We first prove that the free energy has a finite-size correction to its limit—the replica-symmetric formula—which we explicitly compute. This provides a formula for the Kullback-Leibler divergence between the planted and null models. Second, we prove that below the reconstruction threshold, where it becomes impossible to reconstruct the spike, the log-likelihood ratio has fluctuations of constant order and converges in distribution to a Gaussian under both the planted and (under restrictions) the null model. As a consequence, we provide a general proof of contiguity between these two distributions that holds up to the reconstruction threshold, and is valid for an arbitrary separable prior on the spike. Formulae for the total variation distance, and the Type-I and Type-II errors of the optimal test are also given. Our proofs are based on Gaussian interpolation methods and a rigorous incarnation of the cavity method, as devised by Guerra and Talagrand in their study of the Sherrington-Kirkpatrick spin-glass model.
1 Introduction
Spiked models, which are distributions over matrices of the form “signal + noise,” have been a mainstay in the statistical literature since their introduction by Johnstone, (2001) as models for the study of high-dimensional principal component analysis that are tractable yet realistic. Spectral properties of these models have been extensively studied, in particular in random matrix theory, where they are known as deformed ensembles (Péché, , 2014). Landmark investigations in this area (Baik et al., , 2005; Baik and Silverstein, , 2006; Péché, , 2006; Féral and Péché, , 2007; Capitaine et al., , 2009) have established the existence of a spectral threshold above which the top eigenvalue detaches from the bulk of eigenvalues and becomes informative about the spike, and below which the top eigenvalue bears no information. Estimation using the top eigenvector undergoes the same transition, where it is known to “lose track” of the spike below the spectral threshold (Paul, , 2007; Nadler, , 2008; Johnstone and Lu, , 2009; Benaych-Georges and Nadakuditi, , 2011). Although these spectral analyses have provided many insights, as have analyses based on more thoroughgoing usage of spectral data and/or more advanced optimization-based procedures (see Amini and Wainwright, , 2009; Berthet and Rigollet, , 2013; Onatski et al., , 2013, 2014; Dobriban, , 2017, and references therein), they stop short of characterizing the fundamental limits of estimating the spike, or detecting its presence from the observation of a sample matrix. These questions, information-theoretic and statistical in nature, are more naturally approached by looking at objects such as the posterior law of spike and the associated likelihood ratio process.
The main approach to date to the challenging problem of controlling the likelihood ratio is via the second moment method. Controlling the second moment enables one to show contiguity, in the sense of Le Cam, (1960), between the planted and null models and thus declare impossibility of strong detection—i.e., the impossibility of vanishing Type-I and Type-II errors of any given test—in the region where this second moment is bounded (Banks et al., , 2017; Perry et al., , 2016). This method is known, however, to require careful conditioning and truncation due to the existence of rare but catastrophic events under which the likelihood ratio becomes exponentially large. These events thus dominate the second moment, although they are virtually irrelevant to the detection task. Moreover, even after conditioning the method may fail in identifying the detection thresholds, depending on the structure of the spike. Furthermore, contiguity has little or no bearing on the problem of weak detection: When errors are inevitable, what is the smallest error achievable by any test?
Motivated by a desire to overcome these limitations, we consider a particularly simple spiked model—the rank-one spiked Wigner model—and provide an alternative approach to the detection problem that obviates the use of the second moment method altogether. This is achieved by obtaining asymptotic distributional results for the log-likelihood ratio process, then appealing to standard results from the theory of statistical experiments. We are thereby able to provide solutions to both the strong and weak variants of the detection problem. To study the likelihood ratio in this setting we build on the technology developed by Aizenman, Guerra, Panchenko, Talagrand, and many others, in their study of the Sherrington-Kirkpatrick (SK) spin glass-model. Specifically, we make use of Gaussian interpolation methods and Talagrand’s cavity method.
1.1 Setup and summary of the results
In the spiked Wigner model, one observes a rank-one deformation of a Wigner matrix :
| (1) |
where and are independent for all . The spike vector represents the signal to be recovered, or its presence detected. We assume that the entries of the spike are drawn i.i.d. from a prior distribution on having bounded support. The parameter plays the role of the signal-to-noise ratio, and the scaling by is such that the signal and noise components of the observed data are of comparable magnitudes. This places the problem in a high-noise regime where consistency is not possible but partial recovery still is. As a matter of convenience, we discard the diagonal terms from the observations. (Adding the diagonal back does not pose any additional technical difficulties, and our results can be straightforwardly extended to this case.) We endow the real line with the Borel -algebra and define on it. We denote by the joint probability law of the observations as per (1) and define the likelihood ratio
| (2) |
A simple computation based on conditioning on reveals that
| (3) |
We define the free energy (density) associated with the model to be
| (4) |
We see that , where is the Kullback-Leibler divergence between probability measures. (The free energy is usually defined differently in the literature as the log-normalizing constant in the posterior of given . The two definitions are strictly equivalent.) It was initially argued via heuristic replica and cavity computations (Lesieur et al., , 2015, 2017) that converges to a limit , which is referred to as the replica-symmetric formula. This formula, variational in nature, encodes in principle a full characterization of the limits of estimating the spike with non-trivial accuracy. Indeed, various formulae for other information-theoretic quantities can be deduced from it, including the mutual information between and , the minimal mean squared error of estimating based on , and the overlap of a draw from the posterior with the spike . Most of these claims have subsequently been proved rigorously in a series of papers (Deshpande and Montanari, , 2014; Deshpande et al., , 2016; Barbier et al., , 2016; Krzakala et al., , 2016; Lelarge and Miolane, , 2016) under various assumptions on the prior. However, these results stop short of providing explicit characterizations of thresholds for the detection problem.
The main goal of this paper is to gain a more refined understanding of the asymptotic behavior of the log-likelihood ratio , and its mean , under as becomes large. We first determine the finite-size correction of to its limit : we prove (under conditions on ) that converges to a limit with rate . Besides providing an explicit rate of convergence of to its limit, this result translates into a formula for the Kullback-Leibler divergence , which is particularly interesting below the reconstruction threshold: we will see that in this regime , so ceases to be extensive in the size of the system and converges to a finite value .
Second, we prove that in this same regime, the log-likelihood ratio has fluctuations of constant order under , and converges asymptotically to a Gaussian with a mean equal to half the variance. This allows us to provide an alternative proof of contiguity between and , valid in the entire regime where contiguity can possibly hold, as well as a formula for the Type-II error for testing between these two distributions.
Under the null distribution on the other hand, the model is equivalent to the widely studied Sherrington-Kirkpatrick model (provided that ). In one of the first rigorous results on this model, Aizenman et al., (1987) proved that in the high-temperature regime and in the absence of an external field, the fluctuations of the log-partition function of the model about its mean, which is given by the “annealed” computation, are asymptotically Gaussian with explicit mean and variance. By mapping their result into our setting, we obtain the fluctuations of under in the non-reconstruction phase, as well as a formula for the Type-I error. Although we only obtain this last formula for the Rademacher prior, we conjecture its validity for arbitrary priors. An interesting symmetry emerges from these results: the limiting Gaussians under and have means of equal magnitude and opposite signs, and equal variances. This symmetry causes the Type-I and Type-II errors to be equal. Adding up the two latter quantities, we obtain a formula for the total variation distance .
Our results are in the spirit of those of Onatski et al., (2013, 2014), who studied the likelihood ratio of the joint eigenvalue densities under the spiked covariance model with a sphericity prior, and showed its asymptotic normality below the spectral threshold. Their results thus pertain to eigenvalue-based tests while ours are for arbitrary tests, albeit for a simpler model. Our results show in particular that it is still possible to distinguish the planted model from the null model with non-vanishing probability below the reconstruction threshold; i.e., even when estimation of the spike becomes impossible. Performing such a test in practice of course hinges on the computational problem of efficiently computing the likelihood ratio; we leave open this question of constructing computationally efficient tests in the non-reconstruction phase.111If one observes the diagonal, then one can test using the trace of . This test would still however be suboptimal.
1.2 Background
The formula.
For , consider the function
| (5) |
where , and . This is the divergence between the distributions of the random variables and . We define the Replica-Symmetric () potential
| (6) |
and finally define the formula
| (7) |
A central result in this context is that free energy converges to the formula for all (Lesieur et al., , 2015, 2017; Deshpande et al., , 2016; Barbier et al., , 2016; Krzakala et al., , 2016; Lelarge and Miolane, , 2016):
The values of that maximize the potential and their properties play an important role in the theory. Lelarge and Miolane, (2016) proved that the map has a unique maximizer for all where is the set of points where the function is differentiable. By convexity of (see next section), . Moreover, they showed that the map is non-decreasing, and
| (8) |
One should interpret the value as the best overlap an estimator based on observing can have with the spike . Indeed, the overlap between the spike and a random draw from the posterior should concentrate in the large limit about (hence the name “replica-symmetry”). A matrix variant of this result (where one estimates ) was proved in (Lelarge and Miolane, , 2016). In Section 3, we prove strong (vector) versions of this result where under mild assumptions, optimal rates of convergence are given.
The reconstruction threshold.
The first limit in (8) shows that when the prior is not centered, it is always possible to have a non-trivial overlap with for any . On the other hand, when the prior has zero mean, and since is a non-decreasing function of , it is useful to define the critical value of below which estimating becomes impossible:
| (9) |
We refer to as the critical or reconstruction threshold. The next lemma establishes a natural bound on .
Lemma 1.
We have
| (10) |
Proof.
Indeed, assume that is centered, and let . Since and , we see that and . So cannot be a maximizer of . Therefore and .
The importance of Lemma 1 stems from the fact that the value is the spectral threshold previously discussed. Above this value, the first eigenvalue of the matrix leaves the bulk, and is at the edge of the bulk below it (Péché, , 2006; Capitaine et al., , 2009; Féral and Péché, , 2007). This value also marks the limit below which the first eigenvector of captures no information about the spike (Benaych-Georges and Nadakuditi, , 2011). Inequality (10) can be strict or turn into equality depending on the prior . For instance, there is equality if the prior is Gaussian or Rademacher—so that the first eigenvector overlaps with the spike as soon as estimation becomes possible at all—and strict inequality in the case of the (sufficiently) sparse Rademacher prior . More precisely, there exists a value
such that for , and for . In the latter case, the spectral approach to estimating fails for , and it is believed that no polynomial time algorithm succeeds in this region (Lesieur et al., , 2015; Krzakala et al., , 2016; Banks et al., , 2017).
2 Main results
2.1 Finite size corrections to the formula
The results we are about to present hold in a possibly slightly smaller set than . While uniqueness of only needs first differentiability of the formula, our results need a second derivative to exist. In physics parlance, our results do not hold at values of at which a particular kind of first-order phase transition occurs, namely, one in which the order parameter is not differentiable. The presence of these transitions depends again on the prior . For the Gaussian and Rademacher prior, there are no such transitions, while for the sparse Rademacher prior discussed above, there is one first-order transition where is not defined for every . Thus we define the set
Since is the point-wise limit of a sequence of convex functions, it is also convex. Then by Alexandrov’s theorem (Aleskandrov, , 1939), the set is of full Lebesgue measure in (cf. .) Moreover, we can see that , since if , we have , therefore . By continuity, vanishes on the entire interval . Our first main result is to establish the existence of a function defined on such that either below or above it when the prior is not symmetric about the origin, we have
An explicit formula for will be given. But first we need to introduce some notation. Let and consider the quantities
| (11) |
where
with and the expectation operator is w.r.t. and . The Gibbs measure can be interpreted as the posterior distribution of given the observation . (More on this point of view in Section 3.) Now let
| (12) |
and finally define
| (13) |
We will prove (Lemma 18) that for all so that this function is well defined on .
Theorem 2.
For , if either , or and the prior is not symmetric about the origin, then
or equivalently, .
The theorem asserts that either below the reconstruction threshold, or above it when the prior is not symmetric, the free energy has a finite-size correction of order to its limit and a subsequent term of order in the expansion. In the case with symmetric prior, the problem is invariant under a sign flip of the spike, so the overlap has a symmetric distribution, and hence concentrates equiprobably about two distinct values . Our techniques do not survive this symmetry, and resolving this case seems to require a new approach.
We see that is an extensive quantity in whenever , or equivalently, . On the other hand, this is of constant order below :
Centered prior.
Let us consider the case where the prior has zero mean, and unit variance (the latter can be assumed without loss of generality by rescaling ), so that Lemma 1 reads . If , we have , , and one can check that in this case
Therefore, expression (13) simplifies to
By the above calculation, we have a formula for the divergence between and below the reconstruction threshold (see plot in Figure 1):
Corollary 3.
Assume the prior is centered and of unit variance. Then for all ,
| (14) |
More information on .
Expression (13) looks mysterious at first sight. Let us briefly explain its origin. A slightly less processed expression for is the following
after which (13) follows by simple integration. The integrand in the above expression is obtained, as we will show, as the first entry of the solution of the linear system
where and is the “cavity” matrix
The above matrix happens to have two eigenvalues which are exactly and . The matrix and the above linear system will emerge naturally as a result of the cavity method. On the other hand, the integral over the time parameter is along an interpolation path invented by Guerra, (2001), (see also Guerra and Toninelli, 2002b, ) in the context of the Sherrington–Kirkpatrick model, and the integrand can be interpreted as the asymptotic variance in a central limit theorem satisfied by the overlap between two “replicas” under the law induced by a certain interpolating Gibbs measure. A definition of these notions with the corresponding results can be found in Sections 3 and 4. The full execution of the cavity method is relegated to Section 5.
2.2 Fluctuations below the reconstruction threshold
Corollary 14 asserts that below the reconstruction threshold, the expectation of the log-likelihood ratio under is of constant order (in ) and is asymptotically equal to . In this section we are interested in the fluctuations of this quantity about its expectation. It can be seen by a standard concentration-of-measure argument that for all , concentrates about its expectation with fluctuations bounded by . While this bound is likely to be of the right order above (this is true for the SK model, see Guerra and Toninelli, 2002a, ), it is very pessimistic below . Indeed, we will show that the fluctuations are of constant order with a Gaussian limiting law in this regime. This phenomenon was noticed early on in the case of the SK model: Aizenman et al., (1987) showed that in the absence of an external field, the log-partition function of this model has (shifted) Gaussian fluctuations about its easily computed “annealed average” in high temperature. We will directly deduce from their result a central limit theorem for under in the case where the prior is Rademacher. Furthermore, a proof by Talagrand, 2011b of their result provided us with a road map for proving a similar result under . We now present our second main result along with consequences for hypothesis testing. For , let
Theorem 4.
Assume the prior is centered and of unit variance.
-
(i)
For all , if then
-
(ii)
Under the additional condition , if , then for all ,
The symbol denotes convergence in distribution as . The formal connection to the SK model and the proof of the above theorem are presented in Section 6.
A remarkable feature is the symmetry between the above two statements. Roughly speaking, this symmetry takes its roots in the fact that the model under the alternative distribution “lives on the Nishimori line”: under , which is the spiked model, the interaction of the spike with a replica creates terms that account for twice the contribution of the interaction between two independent replicas and , and thus flips the sign of the mean from to . This mechanism will become apparent from the proof. Moreover, the fact that the mean is half the variance in these limiting Gaussians has interesting consequences for hypothesis testing. The next subsection is devoted to this problem.
We believe that statement (ii) is still valid up to for a general prior (of zero mean and unit variance). It is possible to prove the convergence up to some value with essentially the same approach as ours, but reaching the optimal threshold seems to require more technical work. In particular, our interpolation bound (see our “main estimate”, section 4.2) has to be significantly improved to deal with this case. Progress will be reported in a future work.
We also point out that similar fluctuation results were recently proved by Baik and Lee, (2016, 2017) for a spherical model where one integrates over the uniform measure on the sphere in the definition of . Their model, due to its integrable nature, is amenable to analysis using tools from random matrix theory. The authors are thus able to also analyze a “low temperature” regime (absent in our problem) where the fluctuations are no longer Gaussian but given by the Tracy-Widom distribution. Their techniques seem to be tied to the spherical case however.
2.2.1 Strong and weak detection below .
Consider the problem of deciding whether an array of observations is likely to have been generated from for a fixed or from . Let us denote by the null hypothesis and the alternative hypothesis. Two formulations of this problem exist: one would like to construct a sequence of measurable tests that returns “0” for and “1” for , for which either
| (15) |
or less stringently, the total mis-classification error, or risk
| (16) |
is minimized among all possible tests . The question of existence of a test that answers to the requirement (15) is referred to as the strong detection problem, and the question of minimizing the criterion (16) is referred to as the weak detection, or simply hypothesis testing problem.
Strong detection.
Using a second moment argument based on the computation of a truncated version of , Banks et al., (2017) and Perry et al., (2016) showed that and are mutually contiguous when , where the latter quantity equals for some priors while it is suboptimal for others (e.g., the sparse Rademacher case, see discussion below). It is easy to see that contiguity implies impossibility of strong detection since for instance, if then . Here we show that Theorem 4 provides a more powerful approach to contiguity:
Corollary 5.
Assume the prior is centered and of unit variance. Then for all , and are mutually contiguous.
Proof.
A consequence of statement (i) in Theorem 4 is that if
under along some subsequence and for some random variable , then by the continuous mapping theorem we necessarily have
We have , and since , we have . We now conclude using Le Cam’s first lemma in both directions (Lemma 6.4 or Example 6.5, Van der Vaart, , 2000).
This approach allows one to circumvent second moment computations which are not guaranteed to be tight in general, and necessitate careful and prior-specific conditioning that truncates away undesirable events.
We note that in the case of the sparse Rademacher prior , contiguity holds for all as soon as by the above corollary, thus closing the gaps in the results of Banks et al., (2017) and Perry et al., (2016). Indeed, as argued below Lemma 1, the reconstruction and spectral thresholds are equal () for all , and differ () below . This implies that strong detection is impossible for and possible otherwise when , while it becomes impossible only below but possible otherwise when .
Weak detection.
We have seen that strong detection is possible if and only if . It is then natural to ask whether weak detection is possible below , i.e., is it possible to test with accuracy better than that of a random guess below the reconstruction threshold? The answer is yes, and this is another consequence of Theorem 4. More precisely, the optimal test minimizing the risk (16) is the likelihood ratio test which rejects the null hypothesis (i.e., returns “1”) if , and its error is
| (17) |
One can readily deduce from Theorem 4 the Type-I and Type-II errors of the likelihood ratio test: for all the Type-II error is
and in the case of the Rademacher prior, the Type-I error is
for all . Here, is the complementary error function. These can be combined into a formula for and the total variation distance between and (see plot in Figure 1):
Corollary 6.
Assume . For all , we have
| (18) |
We similarly conjecture that the formula for Type-I error, hence formula (18), should be correct up to for all (bounded) priors with zero mean and unit variance.


3 Overlap convergence: optimal rates
A crucial component of proving our main results is understanding the rate of convergence of the overlap , where is drawn from , to its limit . By Bayes’ rule, we see that
| (19) |
where is the Hamiltonian
| (20) | ||||
From the formulas (3) and (4), it is straightforward to see that
This provides another way of interpreting as the expected log-partition function (or normalizing constant) of the posterior . For an integer and , we define the Gibbs average of w.r.t. as
| (21) |
This is nothing else that the average of with respect to . The variables are called replicas, and are interpreted as random variables independently drawn from the posterior. When we simply write instead of . Throughout the rest of the manuscript, we use the following notation: for , we let
In this section we show the convergence of the first 4 moments of the overlap at optimal rates under some conditions: if either the prior is not symmetric about the origin or the Hamiltonian is “perturbed” in the following sense. Let and consider the“ interpolating” Hamiltonian (this qualification will become clear in the next section)
| (22) | ||||
where the ’s are i.i.d. standard Gaussian r.v.’s independent of everything else, and . We similarly define the Gibbs average as in (21) where is replaced by . We now state a fundamental property satisfied by both and .
The Nishimori property.
The fact that the Gibbs measure is a posterior distribution (19) has far-reaching consequences. A crucial implication is that the -tuples and have the same law under . This fact, which is a simple consequence of Bayes’ rule (see Proposition 16, Lelarge and Miolane, , 2016) prevents replica-symmetry from breaking (see Korada and Macris, , 2009). In particular, and have the same distribution. This bares the name of the Nishimori property in the spin glass literature (Nishimori, , 2001). Moreover, this property is preserved under the interpolating Gibbs measure for all . Indeed, the interpolation is constructed in such a way that is the posterior distribution of the signal given the augmented set of observations
| (23) |
where one receives side information about from a scalar Gaussian channel, , and the signal-to-noise ratios of the two channels are altered in a time dependent way. Now we state our concentration result.
Theorem 7.
For all and all , there exist constants and such that
| (24) |
Moreover, on , and if either or is not symmetric about the origin, then for some constant . Otherwise, as .
If is symmetric about the origin then the distribution of under is also symmetric, so . If moreover (i.e., ) then (24) becomes trivial at since both sides are constant. On the other hand, if either or is asymmetric, the sign symmetry of the spike is broken. This forces the overlap to be positive and hence concentrate about . Finally, if , and the sign symmetry becomes irrelevant since the overlap converges to zero regardless. Let us mention that in the symmetric unperturbed case (), we expect a variant of (24) to hold where is replaced by its absolute value in the statement, and the upper bound would be . Unfortunately, our methods do not allow us to prove such a statement, but we are able to prove a weaker result (see Lemma 13): for all ,
| (25) |
Although this a minor technical point, we also point out that the estimate in the statement is suboptimal. A heuristic argument allows us to get as , but we are currently unable to rigorously justify it.
MMSE.
The bound (24) can be used to deduce the optimal error of estimating based on the observations (23). The posterior mean is the estimator with Minimal Mean Squared Error (MMSE) among all estimators , and the MMSE is
The last line follows from the Nishimori property, since . Theorem 7 implies in particular (under the conditions of its validity) that , yielding the value of the MMSE. It is in particular possible to estimate the spike from the observations (23) with non-trivial accuracy if and only if . Note that at (no side information) the result still holds below or when the prior is not symmetric. Otherwise, as mentioned before, the problem is invariant under a sign flip of so one has to change the measure of performance. Beside the result (25), we are unable to say much in this situation.
Asymptotic variance.
By Jensen’s inequality we deduce from (24) the convergence of the second moment:
| (26) |
To establish our finite-size correction result (Theorem 2) we need to prove a result stronger than (26), namely that converges to a limit. For and , we let
| (27) |
where and are defined in (12).
Theorem 8.
For all and all , there exist constants and such that
Moreover, on , and if either or is not symmetric about the origin, then for some constant . Otherwise, as .
4 The interpolation method
In this section we present the interpolation method of Guerra, (2001). All our main arguments will rely, in one way or another, on this method. Along the way, we prove Theorem 2. The idea is to construct a continuous interpolation path between the Hamiltonian and a simpler Hamiltonian that decouples all the variables, and analyze the incremental change in the free energy along the path. We present two versions of this method. The first one is the classical method which is applied to the free energy of the entire system, and a second one applied to the free energy of a more restricted system.
4.1 The Guerra interpolation
Our interpolating Hamiltonian is from (22) with for some . Now we consider the interpolating free energy
| (28) |
We see that and . This function is moreover differentiable in , and by differentiation, we have
Now we use Gaussian integration by parts to eliminate the variables and . The details of this computation are explained extensively in many sources. See (Talagrand, 2011a, ; Krzakala et al., , 2016; Lelarge and Miolane, , 2016). We get
Completing the squares yields
| (29) | ||||
The first line in the above expression involves overlaps between two independent replicas, while the second one involves overlaps between one replica and the planted solution. Using the Nishimori property, the derivative of can be written as
| (30) |
The last term follows by symmetry between sites. Now, integrating over , the difference between the free energy and the potential can be written in the form of a sum rule:
| (31) |
We see from (31) that converges to if and only if the overlap concentrates about . This happens only for a value of that maximizes the potential . Using Theorem 7 one can already prove the optimal rate below or above it when the prior is not symmetric. Indeed since is lower-bounded by a positive constant in this case, the bound (26) yields . Also, the second integrand in (31) is bounded by for some constant , so we have for all , . If and the prior is symmetric then we are only able to prove a rate of due to the fact as . The rate would follow immediately in this case if one is able to improve the latter estimate to . To go further, we use Theorem 8, and the additional fact that has a limit:
Lemma 9.
For all and for all , there exist constants and such that
Moreover, on , and if either or is not symmetric about the origin, then for some constant . Otherwise, as .
The proof of Lemma 9 relies on the cavity method, and will be presented in the Section 5. Now we are ready to prove Theorem 2.
Proof of Theorem 2. By formula (31) with the choice , we have
By Theorem 8 and Lemma 9, the integrands on the right-hand side are bounded by where for all in the cases or not symmetric about the origin, so the convergence follows. The function is the second term in the left-hand side. Formula (13) follows by integration.
4.2 The main estimate: energy gap at suboptimal overlap
Recall the interpolating Hamiltonian from (22) with . Let us now introduce the Franz-Parisi potential (Franz and Parisi, , 1995). For and we define
| (32) |
This is the free energy of a subsystem of configurations having an overlap close to a fixed value with the planted signal . It is clear that , where the latter is the interpolating free energy defined in (28). The purpose of this section is to prove that when is far from then there is a sizable gap between and . This estimate is a main ingredient is our proof of overlap concentration. (The other main ingredient is the cavity method, which will be presented in the next section.) To prove this we will need the auxiliary function
One can show that the above formula is the limit of as , for example, by using the so called “Aizenman-Sims-Starr scheme” (Aizenman et al., , 2003); see (Lelarge and Miolane, , 2016). For our purposes we will only need the inequality
| (33) |
which can be proved using the interpolation method presented in the previous section and dropping the non-negative term from the expression analogous to (30) in this case. Now it suffices to compare to . The result is given in Proposition 12, and we finish this subsection by Proposition 13 showing convergence in probability of the overlaps as a straightforward consequence.
For and , we let
| (34) |
We see that , but unlike , the function does not have an interpretation as the between two distributions. The next lemma states some key properties of this function that will be useful later on.
Lemma 10.
For all , it holds that
-
•
The function is strictly convex, hence strongly convex on any compact.
-
•
There exist a constant such that . If then unless the prior is symmetric about the origin (in which case ).
-
•
The map is increasing on .
The proof of the above lemma can be found in the Appendix. We now state a useful interpolation bound on . This is a simpler version of the Guerra-Talagrand interpolation bound at fixed overlap, a key invention that ultimately paved the way towards a proof of the Parisi formula (Guerra, , 2003; Talagrand, , 2006). In some sense, since we are dealing with a planted model, we only need a replica-symmetric version of this bound.
Proposition 11.
Fix , , and . Let , . There exist constants , such that
Proof.
To obtain a bound on for any fixed , we use the interpolation method with Hamiltonian
by varying . The r.v.’s are all i.i.d. standard Gaussians independent of everything else. We define
We compute the derivative w.r.t. . The same algebraic manipulations conducted in the computation of up to (29) apply here, and we get
where is the Gibbs average w.r.t. the Hamiltonian . A few things now happen. Notice that the planted term (first term in the second line) is trivially smaller than due to the overlap restriction. Moreover, the last terms in both lines are of order since the variables are bounded. The first term in the first line, which involves the overlap between two replicas, is more challenging. What makes this term difficult to control is that the Gibbs measure no longer satisfies the Nishimori property due to the overlap restriction, so the overlap between two replicas no longer has the same distribution as the overlap of one replica with the planted spike. Fortunately, this term is always non-positive so we can ignore it altogether and obtain an upper bound:
Integrating over , we get
Now it remains to show that
By properties of the Gaussian distribution, we can write . Define the following (random) probability measure
for all Borel sets . We observe that conditionally on the Gaussian vector and the planted vector , is a product measure due to the additive form of . Moreover,
so will be interested in a large deviation bound on the above quantity. The prior is of bounded support, thus the marginals of (conditional on and ) are clearly sub-Gaussian. Therefore, by concentration of measure, the empirical average must concentrate around its expectation: for all
where is for instance twice the diameter of the support of . This implies
where . Now by Jensen’s inequality ( and are convex), we can write
Since
we have . We now use the elementary inequality , valid for all and , to simplify the above bound and obtain
This allows us to conclude.
A consequence of the above proposition is an energy gap property: if is far from then the free energy of the configurations having overlap with is strictly smaller than :
Proposition 12.
For all , all and all , there exist constants and such that
Moreover, if then . If either or is not symmetric about the origin . Lastly, if and is symmetric, then as , for some .
A direct consequence of the above energy gap result is the convergence in probability of the overlaps:
Proposition 13.
For all , all and all , there exist constants such that
where the constant the same properties as in Proposition 12, except that as . Moreover, if , is symmetric and then one still has
with .
To prove the above lemma, we first show that the partition function of the model enjoys sub-Gaussian concentration in logarithmic scale. This is an elementary consequence of two classical concentration-of-measure results: concentration of Lipschitz functions of Gaussian random variables, and concentration of convex Lipschitz function of bounded random variables. See Boucheron et al., (2013) and van Handel, (2014).
Lemma 14.
Let be a Borel subset of , and define the random variable
There exist a constant depending only on and such that for all ,
Proof of Lemma 14. It suffices to notice that the map is Lipschitz with constant for every , and that the map (where the expectation is over ) is convex and Lipschitz with constant . Then we use concentration of Lipschitz functions of Gaussian r.v.’s and of convex Lipschitz functions of bounded r.v.’s (since the coordinates are bounded).
Proof of Proposition 13. For , we can write the decomposition
where the integer index ranges over a finite set of size since the prior has bounded support. We will only treat the first sum in the above expression since the argument extends trivially to the second sum. Let and write
| (35) |
In virtue of Lemma 14 the two quantities in the above fraction enjoy sub-Gaussian concentration in logarithmic scale. For any given and , we simultaneously have
(the last inequality come from (33)), and
with probability at least . On the complement of this event event, we simply bound the fraction in (35) by 1. Combining the above bounds we have
where with . We let be a function of and as dictated by Proposition 12, and such that for all such that . Now we conclude by letting . Finally, if is symmetric and , then it suffices to consider non-negative values of in the above argument to prove the corresponding statement.
Proof of Propopsition 12. The gap we seek to prove will come from different sources, depending on the particular cases we will look at. The treatment will be split into several nested cases, depending on whether is small or not and whether positive and negative.
Large .
Assume to be determined later. For , Proposition 11 implies
Since is a convex function we have
| (36) |
Since is the unique maximizer of , implies that for some . This in turn implies
where we have chosen such that . The conclusion is reached for . Now we would like to prove the same bound on for negative overlaps. Proposition 11 implies that for ,
| (37) |
If then , and by Proposition 10 we have , and we finish the argument as in the case of positive overlap. Now we deal with the case .
- •
- •
Small .
Assume now that . In this situation, we draw the gap from the term (so far unused) in Proposition 11. The functions and have bounded second derivatives so
Moreover,
Since we have
and here we choose to be accordingly small, and we finish the argument.
Note that the assumption that is not symmetric about the origin is used only in the case where the (negative) overlap is close to . Consequently, the gap is independent of in all cases. Alternatively, without this asymmetry assumption (and when ), we see that there is no hope of a gap independent of since the potential is closer and closer to being even as . But we can still obtain a gap that depends on via a strong convexity argument.
The is strongly convex on any interval, and for all . Therefore, recalling and , there exists a constant depending only on and (this constant is a bound on ) such that
where the last bounds follows from and (recall that this is the only case where such an argument is needed).
5 The cavity method
Now that we have established the convergence in probability of to under in Lemma 13, we use the cavity method to prove the convergence of the moments of the overlap. In its essence, the cavity method amounts to isolating one variable from the system and analyzing the influence of the rest of the variables on it. It was initially introduced as an analytic tool, alternative to the replica method, to solve certain models of spin glasses (Mézard et al., , 1990), and has since been tremendously successful in predicting the behavior of many mean-field models. The underlying principle is know as the leave-one-out method in statistics. In our setting, this principle is materialized in the form of an interpolation method (yet again) that separates the last variable from the rest.
Our proofs of Theorems 7 and 8 are interlaced. The skeleton of the argument is as follows:
-
1.
We first prove convergence of the second moment: .
-
2.
We then deduce from 1. the convergence of the fourth moment via an inductive argument: . This finishes the proof of Theorem 7.
-
3.
Using 2., we revisit our proof of 1., and refine the estimates in order to obtain the sharper result . This finishes the proof of Theorem 8.
We will start by defining our interpolating Hamiltonian and state some preliminary bounds and properties. Then we will move on to the execution of the cavity computations.
5.1 Preliminary bounds
In this section the parameter is fixed and treated as a constant. We consider the Hamiltonian
where we have striped away the contribution of the variable from (equ. (22)). This contribution is considered separately: for , we let
We let and let our interpolation, with the time parameter , be
At we have , and at the variable decouples from the rest of the variables. For an integer and , we define
similarly to (21). Following Talagrand’s notation, we write
In our last notation, we have only emphasized the dependence of the average on ; the parameter will henceforth remain fixed. Moreover, we write for . The following three lemmas are variants of Lemma 1.6.3, Lemma 1.6.4 and Proposition 1.8.1 respectively in (Talagrand, 2011a, ).
Lemma 15.
For all ,
where we have written .
Proof.
The computation relies on Gaussian integration by parts. See (Talagrand, 2011a, , Lemma 1.6.3) for the details of a similar computation.
Lemma 16.
If is a bounded non-negative function, then for all ,
Proof.
Since the variables and the overlaps are all bounded, and , using Lemma 15 we have for all
Then we conclude using Grönwall’s lemma.
Lemma 17.
For all , and all such that ,
| (38) | ||||
| (39) |
5.2 The cavity matrix
Recall the parameters , and from (11):
where . Now let
| (40) |
One can easily check that the transpose of this matrix has two eigenvalues and with expressions
| (41) |
and associated eigenvectors and , and of multiplicities two and one respectively. (The first eigenvalue appears in a Jordan block.) We will need to control the largest eigenvalue of . This matrix is the “planted” analogue of the one displayed in (Talagrand, 2011a, , equ. (1.234)) for the SK model. By Cauchy-Schwarz, . As will be clear from the next subsection, the cavity computations we are about present are only informative when . Interestingly, this is true for all values of where the formula has two derivatives:
Lemma 18.
For all , .
Proof.
First, if , then , and . By Lemma 1, . Now we assume . Recall
and the potential
It is a straightforward exercise to compute the first and the second derivatives of using Gaussian integration by parts and the Nishimori property:
With the choice , we see that . Now we observe that
Since is a maximizer of the smooth function , and lies in the interior of its domain ( for ), then it must be a first-order stationary point: . Hence , i.e., for all . Now we claim that the inequality must be strict for . Indeed, (Lelarge and Miolane, , 2016, Proposition 15) show that whenever is differentiable at , then the maximizer of is unique and
Therefore, twice differentiability of implies first differentiability of whenever (i.e., ). Now we take advantage of first-order optimality: is the same as
The above can be seen as an equality of functions (of ) defined almost everywhere. Taking one derivative yields
Since and are both positive, the right-hand side cannot vanish. This concludes the proof.
5.3 Cavity computations for the second moment
In this subsection we prove the convergence of the second moment of the overlaps:
with as when and is symmetric about the origin, and uniformly lower-bounded by a positive constant otherwise. To lighten the notation in the calculations to come, will be denoted simply by , and we recall the notation . Let
By symmetry between sites,
By the first bound (38) of Lemma 17 with , , we get
with . On the other hand, by the second bound (39) with , , we get
This is because , since last variable decouples from the remaining variables under the measure . Now, we use Lemma 15 with , to evaluate the above derivative at . We still write .
We extract the average on the -variables from the rest of the expression as pre-factors, so that the above is equal to
We notice that by the Nishimori property that
Now we observe that is a linear combination of terms that resemble, but are not quite equal to , and . We are nevertheless tempted to make the substitution since we expect them to be close. We use Lemma 17 to justify this. Taking as an example, we apply the estimate (38) with , and . We get
with . Moreover,
The third term is of order , and the second term is bounded by . Therefore
with
This argument applies equally to the remaining terms and . We then end up with the identity
| (42) |
where , and is bounded by the same quantity as .
Next, we apply the same reasoning to and as well, (e.g., Lemma 15 needs to applied with for and for ) we get
| (43) | ||||
| (44) |
where for ,
| (45) |
We have ended up with a linear system in the quantities , and . Let and . Then the equations (42), (43) and (44) can be written as
| (46) |
where , and the matrix are defined in (40). The above system implies useful bounds on the coefficients of the vector only if the largest eigenvalue of the matrix is smaller than 1. This is insured by Lemma 18 when (independently of ). Now we can invert the linear system and extract :
| (47) |
Now we need to control the entries of . By elementary manipulations,
and
Therefore, from (45) we have for all ,
| (48) |
Now we will argue that and . With Lemma 13 we have for
and
Combining the above two bounds with (48), and then injecting in (47), we get
The symbols and refer to the norm of a vector and the matrix operator norm respectively. Here, . Note that the matrix inverses are bounded even as since for . We choose small enough and large enough that . We therefore get
5.4 Cavity computations for the fourth moment
In this subsection we prove the convergence of the fourth moment:
where is of the same type as before. We adopt the same technique based on the cavity method, with the extra knowledge that the second moment converges. Many parts of the argument are exactly the same so we will only highlight the main novelties in the proof. Let
By symmetry between sites,
The quadratic term is bounded as
The last inequality is using our extra knowledge about the convergence of the second moment. The last two terms are also bounded by and respectively. Now we must deal with the cubic term, and here, we apply the exact same technique used to deal with the term in the previous proof. The argument goes verbatim. Then we equally treat the terms and . We end up with a similar linear system relating , and :
where . The differences with the earlier linear system (46) are in the vector of coefficients (that could be determined from the recursions) and the error terms , which are now bounded as
Crucially, the matrix remains the same. Using Lemma 13, we have for ,
With the bound we already have on , we finish the argument in the same way, by choosing sufficiently small. This concludes the proof of Theorem 7.
5.5 Sharper results: the asymptotic variance
Finally, given the convergence of the fourth moment, we can refine the convergence result of the second moment. Indeed we are now able to compute the limit of . Using Jensen’s inequality on the second and fourth moment, we have
Looking back at (48), we can now assert that
We plug this new knowledge in (47), and obtain
The last line follows since for . We have just proved that
where . One can solve this linear system explicitly and obtain the expression of the coordinates of :
The expression of the first coordinate defines , equ.(27). This concludes the proof of Theorem 8.
5.6 Proof of Lemma 9
6 Proof of Theorem 4
In this section we prove a slightly stronger result than convergence in distribution. We prove the convergence of all moments with an explicit rate of . Statement is deduced effortlessly form a classical result while statement requires more work. We start with the former.
6.1 Fluctuations under : the ALR CLT
We assume in this subsection that , and let , i.e, i.i.d. We then see that the likelihood ratio is related to the partition function of the Sherrington–Kirkpatrick model via a trivial relation:
where we have let . is the partition function of the SK model at inverse temperature . It is easy to compute the expectation of :
so that is the gap between the free energy of the SK model and its annealed version. The question of determining the values of the inverse temperature for which this gap is zero (or constant), i.e., at what temperatures is the free energy given by the annealed computation? is a central question in statistical physics. Aizenman, Lebowitz and Ruelle (ALR) proved that in the high-temperature regime , converges in distribution the normal law
In our notation this simply means
under where . The ALR proof is combinatorial and uses so-called cluster expansion techniques. It may not extend to other types of priors. Alternative proofs were subsequently found by adopting different perspectives on the problem (Comets and Neveu, , 1995; Guerra and Toninelli, 2002a, ). A more recent proof based on the cavity method is provided by Talagrand in his second book (Talagrand, 2011b, , Section 11.4). His method provides an explicit (and optimal) rate of convergence of the moments of the random variable in question to those of the Gaussian. In what follows we use Talagrand’s approach to prove a similar central limit theorem when , for an arbitrary bounded prior . In this more general setting, the high temperature region of the model is given by the condition .
6.2 Fluctuations under : a planted version of the ALR CLT
Let , and . We define the random variable
where
We will prove that the integer moments of converge to those of the Gaussian with variance . This is a sufficient condition for convergence in distribution to hold, since the Gaussian is uniquely determined by its moments.
Theorem 19.
For all and integers , there exists a constant such that
where is the -th moment of the standard Gaussian .
This theorem mirrors Theorem 11.4.1 in (Talagrand, 2011b, ), and our approach is inspired by his. We define the function
Lemma 20.
For all ,
| (49) |
Proof.
This is by simple differentiation and regrouping of terms.
The derivative involves averages of the form
for . In the first line of (20), we see that the planted term has a pre-factor twice as big the that of the replica term . This is the reason the mean of the limiting Gaussian is and not in the planted case. A crucial step in the argument is to show that and its pre-factor in the above expression are asymptotically uncorrelated, so that one can split the expectation:
Proposition 21.
For all and integers , and , we have
where .
Proof of Theorem 19. Proposition 21 implies
Notice that with our choice of the function , the first term on the right-hand side vanishes. (Setting this term to zero provides another way of discovering the function .) Now we let . We have for all and all
| (50) |
By induction, and since , we see that for all even
where and . The last recursion defines the sequence of even Gaussian moments. As for odd values of , we have already proved in Corollary 3 that
We use induction again on (50) to conclude that for all odd ,
6.3 Proof of Proposition 21
The argument is in two stages. We first prove that
| (51) |
and then
| (52) |
where in both cases . We once again use the cavity method to extract the last variable and analyze its influence. Let
and
We have . We consider the quantity
Our strategy is approximate by . The approach is very similar to the one used to prove optimal rates of convergence of the overlaps. By symmetry between sites, we have
Notice that since the last variables decouple from the rest of the system at , we have
The expressions of the derivatives are a bit cumbersome so we do not display them, but we will describe their main features. From here onwards, we present the proof of (51) and (52) for for concreteness. The exact same argument goes through for . The only difference is in the number of terms showing up the derivatives of , not their nature. The derivative will be a sum of different terms, all of the form
| (53) |
where and , and is a polynomial of degree in . We see that at , if the above expression involves a variable of degree 1 then this term vanishes. Therefore the only remaining term is the one where . One can verify that for this term. Therefore
| (54) |
Now we turn to . Taking another derivative generates monomials of degree three in the overlaps and the last variable, so is a sum of terms of the form
| (55) |
where is a polynomial of degree in , and . Our goal is to bound the second derivative independently of , so that we are able to use the Taylor approximation
| (56) |
Since prior has bounded support, Hölder’s inequality implies that (55) is bounded by
where . The last bound follows from Jensen’s inequality (since ) and another application of Hölder’s inequality. We let and . Using a straightforward analogue of Lemma 16 for the measure , and the convergence of the fourth moment, Theorem 7, we have
We use the following lemma to bound the moments of :
Lemma 22.
For all and integers , there exists a constant such that for all
Proof.
Taking a derivative w.r.t. time, we have
By Hölder’s inequality and boundedness of the variables and overlaps,
The first term is generated by the terms involving in the derivative, and the second term comes from the one involving . Since is even, we drop the absolute values on the right-hand side. Next, use the fact for all and , then we use Grönwall’s lemma to conclude.
Therefore by the above estimates we have
| (57) |
Now, our next goal is to prove
| (58) |
We consider the function (this should come with no surprise at this point)
Observe that (54) tells us . On the other hand,
Using Lemma 22 and Hölder’s inequality, the first term is bounded by , and the second term is bounded by . So it suffices to show that
Similarly to , the derivative of is a sum of terms of the form
It is clear that the same method used to bound (the generic term of which is (55)) also works in this case, so we obtain the desired bound on . Finally, using (56), (57) and (58), we obtain
where . This is equivalent to (51) and closes the first stage of the argument. Now we need to show that
The argument has become a routine by now: we consider the function
We have
The derivative of is a sum of term of the form
By our earlier argument, for all . We similarly argue that for all , so that
This yields (52) and thus concludes the proof.
Appendix A Appendix
Here, we prove Lemma 10. A straightforward calculation reveals that
so that is Lipschitz and strongly convex on any interval, and for all .
Let , and let be the symmetric part of , i.e., for all Borel . We observe that is absolutely continuous with respect to , so that the Radon-Nikodym derivative is a well-defined measurable function from to that integrates to one.
Proposition 23.
For all , we have
where is the average w.r.t. to the Gibbs measure corresponding to the Gaussian channel , and . Moreover, if , the right-hand side of the above inequality is zero if and only if , i.e., the prior is symmetric.
Finally, the last statement is given here.
Lemma 24.
The map is increasing on .
Proof.
This is a matter of showing that the derivative of the above function is non-negative. By standard manipulations (Gaussian integration by parts, Nishimori property), the derivative can be written as
Proof of Proposition 23. The argument relies on a smooth interpolation method between the two measures and . Let and let . Further, let be fixed, and
where . Now let
We have on the one hand, and since is a symmetric distribution, on the other. We will show that is a convex increasing function on the interval , strictly so if , and that . Then we deduce that , allowing us to conclude. First, we have
and
Similar expressions holds for where is replaced by inside the exponentials. We see from the expression of the first derivative at that . This is because is symmetric about the origin, so a sign change (of for the first term, and for the second term in the expression) does not affect the value of the integrals. Hence . Now, we focus on the second derivative. Observe that since is the symmetric part of , is anti-symmetric. This implies that the first term in the expression of the second derivative changes sign under a sign change in , and keeps the same modulus. As for the second term, a sign change in induces integration against . Hence we can write the difference as
For any Borel , we have . Therefore the second term in the above expression becomes
Since both and are absolutely continuous with respect to for all we write
where the Gibbs average is with respect to the posterior of given under the Gaussian channel , and the expectation is under and . By the Nishimori property, we simplify the above expression to
where the expression is valid for all . From here we see that the function is convex on (where we have closed the right end of the interval by continuity). Since , is also increasing on . Therefore we have
Now it remains to show that if for some then . By the lower bound we have shown, equality of and would imply
for (Lebesgue-)almost all and -almost all . We make the change of variable and complete the squares, then the above is equivalent to
for almost all . The above expressions are convolutions of the measures and against the Gaussian kernel. By taking the Fourier transform on both sides and using Fubini’s theorem, we get equality of the characteristic functions of and : for all ,
This is because the Fourier transform of the Gaussian (another Gaussian) vanishes nowhere on the real line, thus it can be simplified on both sides. This of course implies that , and concludes our proof.
Acknowledgments. AE is grateful to Léo Miolane for insightful conversations. This research effort was initiated at the Workshop on Statistical physics, Learning, Inference and Networks at Ecole de Physique des Houches, winter 2017. FK acknowledges funding from the EU (FP/2007-2013/ERC grant agreement 307087-SPARCS). MJ acknowledges the support of the Mathematical Data Science program of the Office of Naval Research under grant number N00014-15-1-2670.
References
- Aizenman et al., (1987) Aizenman, M., Lebowitz, J. L., and Ruelle, D. (1987). Some rigorous results on the Sherrington–Kirkpatrick spin glass model. Communications in Mathematical Physics, 112(1):3–20.
- Aizenman et al., (2003) Aizenman, M., Sims, R., and Starr, S. L. (2003). Extended variational principle for the Sherrington-Kirkpatrick spin-glass model. Physical Review B, 68(21):214403.
- Aleskandrov, (1939) Aleskandrov, A. (1939). Almost everywhere existence of the second differential of a convex function and some properties of convex functions. Leningrad Univ. Ann., 37:3–35.
- Amini and Wainwright, (2009) Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Annals of Statistics, 37(5B):2877–2921.
- Baik et al., (2005) Baik, J., Arous, G. B., Péché, S., et al. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33(5):1643–1697.
- Baik and Lee, (2016) Baik, J. and Lee, J. O. (2016). Fluctuations of the free energy of the spherical Sherrington–Kirkpatrick model. Journal of Statistical Physics, 165(2):185–224.
- Baik and Lee, (2017) Baik, J. and Lee, J. O. (2017). Fluctuations of the free energy of the spherical Sherrington–Kirkpatrick model with ferromagnetic interaction. In Annales Henri Poincaré, volume 18, pages 1867–1917. Springer.
- Baik and Silverstein, (2006) Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97(6):1382–1408.
- Banks et al., (2017) Banks, J., Moore, C., Vershynin, R., Verzelen, N., and Xu, J. (2017). Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. In IEEE International Symposium on Information Theory (ISIT), pages 1137–1141. IEEE.
- Barbier et al., (2016) Barbier, J., Dia, M., Macris, N., Krzakala, F., Lesieur, T., and Zdeborová, L. (2016). Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula. In Advances in Neural Information Processing Systems (NIPS), pages 424–432.
- Benaych-Georges and Nadakuditi, (2011) Benaych-Georges, F. and Nadakuditi, R. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521.
- Berthet and Rigollet, (2013) Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension. Annals of Statistics, 41(4):1780–1815.
- Boucheron et al., (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
- Capitaine et al., (2009) Capitaine, M., Donati-Martin, C., and Féral, D. (2009). The largest eigenvalues of finite rank deformation of large wigner matrices: convergence and nonuniversality of the fluctuations. Annals of Probability, pages 1–47.
- Comets and Neveu, (1995) Comets, F. and Neveu, J. (1995). The Sherrington-Kirkpatrick model of spin glasses and stochastic calculus: the high temperature case. Communications in Mathematical Physics, 166(3):549–564.
- Deshpande et al., (2016) Deshpande, Y., Abbe, E., and Montanari, A. (2016). Asymptotic mutual information for the binary stochastic block model. In IEEE International Symposium on Information Theory (ISIT), pages 185–189. IEEE.
- Deshpande and Montanari, (2014) Deshpande, Y. and Montanari, A. (2014). Information-theoretically optimal sparse PCA. In IEEE International Symposium on Information Theory (ISIT), pages 2197–2201. IEEE.
- Dobriban, (2017) Dobriban, E. (2017). Sharp detection in PCA under correlations: all eigenvalues matter. Annals of Statistics, 45(4):1810–1833.
- Féral and Péché, (2007) Féral, D. and Péché, S. (2007). The largest eigenvalue of rank one deformation of large Wigner matrices. Communications in Mathematical Physics, 272(1):185–228.
- Franz and Parisi, (1995) Franz, S. and Parisi, G. (1995). Recipes for metastable states in spin glasses. Journal de Physique I, 5(11):1401–1415.
- Guerra, (2001) Guerra, F. (2001). Sum rules for the free energy in the mean field spin glass model. Fields Institute Communications, 30:161–170.
- Guerra, (2003) Guerra, F. (2003). Broken replica symmetry bounds in the mean field spin glass model. Communications in Mathematical Physics, 233(1):1–12.
- (23) Guerra, F. and Toninelli, F. L. (2002a). Central limit theorem for fluctuations in the high temperature region of the Sherrington–Kirkpatrick spin glass model. Journal of Mathematical Physics, 43(12):6224–6237.
- (24) Guerra, F. and Toninelli, F. L. (2002b). The thermodynamic limit in mean field spin glass models. Communications in Mathematical Physics, 230(1):71–79.
- Johnstone, (2001) Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, pages 295–327.
- Johnstone and Lu, (2009) Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682–693.
- Korada and Macris, (2009) Korada, S. B. and Macris, N. (2009). Exact solution of the gauge symmetric p-spin glass model on a complete graph. Journal of Statistical Physics, 136(2):205–230.
- Krzakala et al., (2016) Krzakala, F., Xu, J., and Zdeborová, L. (2016). Mutual information in rank-one matrix estimation. In Information Theory Workshop (ITW), pages 71–75. IEEE.
- Le Cam, (1960) Le Cam, L. (1960). Locally Asymptotically Normal Families of Distributions. Certain Approximations to Families of Distributions and Their Use in the Theory of Estimation and Testing Hypotheses. Berkeley & Los Angeles.
- Lelarge and Miolane, (2016) Lelarge, M. and Miolane, L. (2016). Fundamental limits of symmetric low-rank matrix estimation. arXiv preprint arXiv:1611.03888.
- Lesieur et al., (2015) Lesieur, T., Krzakala, F., and Zdeborová, L. (2015). Phase transitions in sparse PCA. In IEEE International Symposium on Information Theory (ISIT), pages 1635–1639. IEEE.
- Lesieur et al., (2017) Lesieur, T., Krzakala, F., and Zdeborová, L. (2017). Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications. arXiv preprint arXiv:1701.00858.
- Mézard et al., (1990) Mézard, M., Parisi, G., and Virasoro, M.-A. (1990). Spin glass theory and beyond. World Scientific Publishing.
- Nadler, (2008) Nadler, B. (2008). Finite sample approximation results for principal component analysis: A matrix perturbation approach. Annals of Statistics, pages 2791–2817.
- Nishimori, (2001) Nishimori, H. (2001). Statistical physics of spin glasses and information processing: an introduction, volume 111. Clarendon Press.
- Onatski et al., (2013) Onatski, A., Moreira, M. J., and Hallin, M. (2013). Asymptotic power of sphericity tests for high-dimensional data. Annals of Statistics, 41(3):1204–1231.
- Onatski et al., (2014) Onatski, A., Moreira, M. J., and Hallin, M. (2014). Signal detection in high dimension: The multispiked case. Annals of Statistics, 42(1):225–254.
- Paul, (2007) Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642.
- Péché, (2006) Péché, S. (2006). The largest eigenvalue of small rank perturbations of Hermitian random matrices. Probability Theory and Related Fields, 134(1):127–173.
- Péché, (2014) Péché, S. (2014). Deformed ensembles of random matrices. In Proceedings of the International Congress of Mathematicians, Seoul, pages 1059–1174. ICM.
- Perry et al., (2016) Perry, A., Wein, A. S., Bandeira, A. S., and Moitra, A. (2016). Optimality and sub-optimality of PCA for spiked random matrices and synchronization. arXiv preprint arXiv:1609.05573.
- Talagrand, (2006) Talagrand, M. (2006). The Parisi formula. Annals of Mathematics, pages 221–263.
- (43) Talagrand, M. (2011a). Mean field models for spin glasses. Volume I: Basic examples, volume 54. Springer Science & Business Media.
- (44) Talagrand, M. (2011b). Mean field models for spin glasses. Volume II: Advanced replica-symmetry and low temperature, volume 55. Springer Science & Business Media.
- Van der Vaart, (2000) Van der Vaart, A. W. (2000). Asymptotic statistics (Cambridge series in statistical and probabilistic mathematics). Cambridge University Press.
- van Handel, (2014) van Handel, R. (2014). Probability in high dimension. Technical report, Princeton University, NJ.